Module 1 Assignment: Corporate Risk Narratives in Form 10-K

Building a Classical and Neural NLP Pipeline for Industry Analysis

Published

February 1, 2026

Modified

February 17, 2026

Important Reminder

  • This assignment may be completed using Google Colab or AWS Academy.
  • Your goal is to understand and implement the NLP pipeline not to scale it.
  • It will be impossible to run anything on your local machine (mac or PC). Please use Google Colab or request AWS Academy labs.
  • The assignment takes significant time to complete. Start early!

Submission Instructions

You must submit two files on blackboard:

  1. Word Document (.docx)
    • Answers to all questions (Q1–Q4)
    • Business interpretation
    • Tables and figures
    • No more than 5 pages (excluding cover page and references)
    • No code in this document.
  2. Jupyter Notebook (.ipynb)
    • Fully executable
    • Clearly organized by Question (Q1–Q4)
    • Well-commented code
  3. You can follow following structur for your solution notebook, feel free to adjust as needed:
    1. config and paths
    2. load and parse 10k
    3. clean and chunk text
    4. feature engineering
    5. predictive models
    6. error analysis

Objective

Note

This assignment introduces the end-to-end NLP lifecycle using real corporate disclosures. You may use any generative AI tool, but you must still build and explain the pipeline yourself.

You will analyze 2024 Form 10-K filings to answer the following business question:

Important

Can corporate risk language be used to understand and predict industry-level risk exposure?

This assignment establishes the cleaned corpus, labels, and baselines that will be reused in Assignments 2–5, where you will build LLM-based and RAG pipelines.

Business Context

An investment firm is reassessing sector exposure amid increasing uncertainty related to:

  • regulation
  • technological disruption
  • operational risk
  • macroeconomic volatility

Rather than relying only on numerical indicators, the firm wants to understand how companies describe risk in their own words.

Each Form 10-K filing includes a Standard Industrial Classification (SIC) code, which will be used as a high-level industry label for analysis and prediction.

Data Source

You will use SEC-provided TXT versions of 2024 Form 10-K filings, available here:

SEC 2024 Form 10-K TXT Files

  • You will use all SEC-provided TXT versions of 2024 Form 10-K filings contained in the shared Google Drive folder.
  • All files in the folder must be processed programmatically.
  • You may later filter, subset, or group the data for analysis, but your NLP pipeline must be capable of ingesting the entire corpus.

2024 SEC 10-K Filings

Accessing the Files in Google Colab

Add Folder to Your Drive

  1. Open the link above.
  2. Right-click the folder → Add shortcut to Drive
  3. Save it in My Drive

Mount Drive in Colab

from google.colab import drive
drive.mount('/content/drive')

Verify Files

import os
os.listdir("/content/drive/MyDrive")

Adjust paths as needed.

1 How should the 10-K corpus be constructed for industry-level analysis?

1.1 Business Goal

  • Load the raw text data from all 10-K filings
  • Create a clean, reproducible corpus that supports comparison across industries.
  • You can choose 7 different industries from the SEC filings.
  • Justify why your selection is rational and what are you gaining from it.

1.2 Technical Instructions

  1. Load and process all Form 10-K TXT files in the shared folder:
    1. Do not manually select files
    2. Your code must iterate over the directory programmatically
  2. For each filing:
    1. Load the full TXT file
    2. Extract:
    3. Item 1A – Risk Factors
    4. Item 7 – MD&A
    5. Extract and store:
    6. company identifier
    7. SIC code
  3. Build a structured corpus where each record contains:
    1. company
    2. SIC code
    3. section name
    4. cleaned text
    5. sentence-level chunks
  4. Save intermediate artifacts:
    1. cleaned section text
    2. sentence-level files
    3. metadata tables
  5. For analysis and reporting only, select:
    1. at least 10 industries

1.3 Expected Outputs

  • Table: company → SIC → industry group → section length
  • Folder structure showing saved artifacts
  • Brief explanation of cleaning decisions
Important

All Form 10-K files in the shared folder must be processed as part of the NLP pipeline.

For Questions 2–4, you may subset the processed corpus for analysis, visualization, and modeling.

2 How should corporate risk language be represented for NLP analysis?

All text representations must be generated for the entire corpus, even if only a subset is used for downstream modeling.

2.1 Business Goal

Determine how narrative risk disclosures should be numerically represented for analysis and prediction.

2.2 Technical Instructions

  1. Build a classical NLP representation:
    • TF-IDF on Item 1A text
    • Explain tokenization and stopword choices
  2. Build a neural representation:
    • Sentence embeddings using a Hugging Face model
    • Aggregate to document-level embeddings
  3. Ensure all representations retain:
    • company identifier
    • SIC code
    • section source (Item 1A or Item 7)
  4. Compare representations:
    • dimensionality
    • sparsity
    • semantic coverage

2.3 Expected Outputs

  • Comparison table of representations
  • Short explanation of trade-offs

3 Can risk language predict industry membership?

For modeling and evaluation, you may restrict the dataset to a subset of SIC groups with sufficient sample size.

3.1 Business Goal

Evaluate whether how firms describe risk contains enough signal to predict industry classification.

3.2 Technical Instructions

3.2.1 Prediction Task

  • Input: Item 1A risk text
  • Target: SIC-based industry group (coarse-grained)

3.2.2 Model A: Classical NLP

  • TF-IDF features

  • Logistic Regression or Linear SVM

  • Report:

    • accuracy
    • confusion matrix

3.2.3 Model B: Basic Neural NLP

  • Sentence/document embeddings

  • Feedforward neural network (MLP)

  • Report:

    • accuracy
    • comparison vs classical model

3.2.4 Analysis

  • Compare model performance
  • Identify industries that are frequently misclassified
  • Discuss interpretability vs performance

3.3 Expected Outputs

  • Model performance table
  • Confusion matrices
  • Short technical discussion

4 How reliable are NLP-based risk assessments for business decision-making?

4.1 Business Goal

Assess whether these models are decision-ready for real investment use.

4.2 Technical Instructions

  1. Identify at least three failure cases, such as:
    • boilerplate language
    • ambiguous phrasing
    • sentiment mismatch
  2. Perform manual validation:
    • inspect representative sentences
    • explain why models failed
  3. Compare:
    • classical vs neural models
    • strengths and weaknesses
  4. Write an executive summary ($ $ page):
    • key findings
    • limitations
    • recommendations

4.3 Expected Outputs

  • Failure case table
  • Executive-style memo

Use of Generative AI

Important

Generative AI tools may be used for:

  • code assistance
  • explanation
  • summarization

They may not replace required pipeline steps.

You must briefly document:

  • where AI tools were used
  • how outputs were validated or modified

Looking Ahead

Note

In Assignments 2–5, you will:

  • replace classical models with large language models
  • build retrieval-augmented generation (RAG) pipelines
  • use vector search and neural retrieval

The cleaned corpus, SIC labels, sentence chunks, and embeddings created here will be reused directly.


Resources

  • SEC EDGAR Filings Guide
  • Hugging Face Transformers Documentation
  • scikit-learn NLP Documentation
  • PyTorch Documentation